{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8b715375",
   "metadata": {},
   "source": [
    "# Top-K Similarity Search - Ask A Book A Question\n",
    "\n",
    "In this tutorial we will see a simple example of basic retrieval via Top-K Similarity search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "9d615a77",
   "metadata": {},
   "outputs": [],
   "source": [
    "# pip install langchain --upgrade\n",
    "# Version: 0.0.164\n",
    "\n",
    "# !pip install pypdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "2d3e92ed",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# PDF Loaders. If unstructured gives you a hard time, try PyPDFLoader\n",
    "from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader, TextLoader\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "from dotenv import load_dotenv\n",
    "import os\n",
    "\n",
    "load_dotenv()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5166d759",
   "metadata": {},
   "source": [
    "### Load your data\n",
    "\n",
    "Next let's load up some data. I've put a few 'loaders' on there which will load data from different locations. Feel free to use the one that suits you. The default one queries one of Paul Graham's essays for a simple example. This process will only stage the loader, not actually load it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "b4a2d6bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = TextLoader(file_path=\"../data/PaulGrahamEssays/vb.txt\")\n",
    "\n",
    "## Other options for loaders \n",
    "# loader = PyPDFLoader(\"../data/field-guide-to-data-science.pdf\")\n",
    "# loader = UnstructuredPDFLoader(\"../data/field-guide-to-data-science.pdf\")\n",
    "# loader = OnlinePDFLoader(\"https://wolfpaulus.com/wp-content/uploads/2017/05/field-guide-to-data-science.pdf\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8d38044",
   "metadata": {},
   "source": [
    "Then let's go ahead and actually load the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "bcdac23c",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e07a744a",
   "metadata": {},
   "source": [
    "Then let's actually check out what's been loaded"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "b4fd7c9e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "You have 1 document(s) in your data\n",
      "There are 9155 characters in your sample document\n",
      "Here is a sample: January 2016Life is short, as everyone knows. When I was a kid I used to wonder\n",
      "about this. Is life actually short, or are we really complaining\n",
      "about its finiteness?  Would we be just as likely to fe\n"
     ]
    }
   ],
   "source": [
    "# Note: If you're using PyPDFLoader then it will split by page for you already\n",
    "print (f'You have {len(data)} document(s) in your data')\n",
    "print (f'There are {len(data[0].page_content)} characters in your sample document')\n",
    "print (f'Here is a sample: {data[0].page_content[:200]}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8af9b604",
   "metadata": {},
   "source": [
    "### Chunk your data up into smaller documents\n",
    "\n",
    "While we could pass the entire essay to a model w/ long context, we want to be picky about which information we share with our model. The better signal to noise ratio we have the more likely we are to get the right answer.\n",
    "\n",
    "The first thing we'll do is chunk up our document into smaller pieces. The goal will be to take only a few of those smaller pieces and pass them to the LLM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "fb3c6f02",
   "metadata": {},
   "outputs": [],
   "source": [
    "# We'll split our data into chunks around 500 characters each with a 50 character overlap. These are relatively small.\n",
    "\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)\n",
    "texts = text_splitter.split_documents(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "879873a4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Now you have 20 documents\n"
     ]
    }
   ],
   "source": [
    "# Let's see how many small chunks we have\n",
    "print (f'Now you have {len(texts)} documents')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "838b2843",
   "metadata": {},
   "source": [
    "### Create embeddings of your documents to get ready for semantic search\n",
    "\n",
    "Next up we need to prepare for similarity searches. The way we do this is through embedding our documents (getting a vector per document).\n",
    "\n",
    "This will help us compare documents later on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "373e695a",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/homebrew/lib/python3.11/site-packages/pinecone/index.py:4: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)\n",
      "  from tqdm.autonotebook import tqdm\n"
     ]
    }
   ],
   "source": [
    "from langchain.vectorstores import Chroma, Pinecone\n",
    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
    "import pinecone"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "884e7857",
   "metadata": {},
   "source": [
    "Check to see if there is an environment variable with you API keys, if not, use what you put below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "42a1d5c3",
   "metadata": {
    "hide_input": false
   },
   "outputs": [],
   "source": [
    "OPENAI_API_KEY = os.getenv('OPENAI_API_KEY', 'YourAPIKey')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3205993a",
   "metadata": {},
   "source": [
    "Then we'll get our embeddings engine going. You can use whatever embeddings engine you would like. We'll use OpenAI's ada today."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "b4619d3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76d66c06",
   "metadata": {},
   "source": [
    "### Option #1: Chroma (for local)\n",
    "\n",
    "I like Chroma becauase it's local and easy to set up without an account.\n",
    "\n",
    "First we'll pass our texts to Chroma via `.from_documents`, this will 1) embed the documents and get a vector, then 2) add them to the vectorstore for retrieval later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "4e0d1c6a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# load it into Chroma\n",
    "vectorstore = Chroma.from_documents(texts, embeddings)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d4750ab",
   "metadata": {},
   "source": [
    "Let's test it out. I want to see which documents are most closely related to a query.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "34929595",
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"What is great about having kids?\"\n",
    "docs = vectorstore.similarity_search(query)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ec60de1",
   "metadata": {},
   "source": [
    "Then we can check them out. In theory, the texts which are deemed most similar should hold the answer to our question.\n",
    "But keep in mind that our query just happens to be a question, it could be a random statement or sentence and it would still work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "4e0f5b45",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "jabs into your consciousness like a pin.The things that matter aren't necessarily the ones people would\n",
      "call \"important.\"  Having coffee with a friend matters.  You won't\n",
      "feel later like that was a waste of time.One great thing about having small children is that they make you\n",
      "spend time on things that matter: them. They grab your sleeve as\n",
      "you're staring at your phone and say \"will you play with me?\" And\n",
      "\n",
      "the question, and the answer is that life actually is short.Having kids showed me how to convert a continuous quantity, time,\n",
      "into discrete quantities. You only get 52 weekends with your 2 year\n",
      "old.  If Christmas-as-magic lasts from say ages 3 to 10, you only\n",
      "get to watch your child experience it 8 times.  And while it's\n",
      "impossible to say what is a lot or a little of a continuous quantity\n",
      "like time, 8 is not a lot of something.  If you had a handful of 8\n",
      "\n",
      "January 2016Life is short, as everyone knows. When I was a kid I used to wonder\n",
      "about this. Is life actually short, or are we really complaining\n",
      "about its finiteness?  Would we be just as likely to feel life was\n",
      "short if we lived 10 times as long?Since there didn't seem any way to answer this question, I stopped\n",
      "wondering about it.  Then I had kids.  That gave me a way to answer\n",
      "\n",
      "done that we didn't.  My oldest son will be 7 soon.  And while I\n",
      "miss the 3 year old version of him, I at least don't have any regrets\n",
      "over what might have been.  We had the best time a daddy and a 3\n",
      "year old ever had.Relentlessly prune bullshit, don't wait to do things that matter,\n",
      "and savor the time you have.  That's what you do when life is short.Notes[1]\n",
      "At first I didn't like it that the word that came to mind was\n",
      "one that had other meanings.  But then I realized the other meanings\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Here's an example of the first document that was returned\n",
    "for doc in docs:\n",
    "    print (f\"{doc.page_content}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b73d8504",
   "metadata": {},
   "source": [
    "### Option #2: Pinecone (for cloud)\n",
    "If you want to use pinecone, run the code below, if not then skip over to Chroma below it. You must go to [Pinecone.io](https://www.pinecone.io/) and set up an account"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "0e093ef3",
   "metadata": {
    "hide_input": false
   },
   "outputs": [],
   "source": [
    "# PINECONE_API_KEY = os.getenv('PINECONE_API_KEY', 'YourAPIKey')\n",
    "# PINECONE_API_ENV = os.getenv('PINECONE_API_ENV', 'us-east1-gcp') # You may need to switch with your env\n",
    "\n",
    "# # initialize pinecone\n",
    "# pinecone.init(\n",
    "#     api_key=PINECONE_API_KEY,  # find at app.pinecone.io\n",
    "#     environment=PINECONE_API_ENV  # next to api key in console\n",
    "# )\n",
    "# index_name = \"langchaintest\" # put in the name of your pinecone index here\n",
    "\n",
    "# docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c35dcd9",
   "metadata": {},
   "source": [
    "### Query those docs to get your answer back\n",
    "\n",
    "Great, those are just the docs which should hold our answer. Now we can pass those to a LangChain chain to query the LLM.\n",
    "\n",
    "We could do this manually, but a chain is a convenient helper for us."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "f051337b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chat_models import ChatOpenAI\n",
    "from langchain.chains.question_answering import load_qa_chain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "6b9b1c03",
   "metadata": {},
   "outputs": [],
   "source": [
    "llm = ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)\n",
    "chain = load_qa_chain(llm, chain_type=\"stuff\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "f67ea7c2",
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"What is great about having kids?\"\n",
    "docs = vectorstore.similarity_search(query)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "3dfd2b7d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'One great thing about having kids is that they make you spend time on things that matter. They remind you to prioritize the important things in life, like spending quality time with them. Having kids can also bring a sense of joy and fulfillment as you watch them grow and experience new things.'"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chain.run(input_documents=docs, question=query)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "516e0a80",
   "metadata": {},
   "source": [
    "Awesome! We just went and queried an external data source!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
